72 ◾ Bioinformatics
you will be able to use it from any directory. While you are in the “bwa” directory run the
“pwd” command to print the absolute path of BWA, copy it, then change to your home
directory, and open “.bashrc” file using “vim” or any text editor of your choice:
cd $HOME
vim .bashrc
Add the following to the end of the “.bashrc” file:
export PATH=”your_path/bwa”:$PATH
Do not forget to replace “your_path” with the path to the “bwa” directory on your com-
puter. Save the “.bashrc” file, exit, and restart the terminal for the change to take effect.
Type “bwa” on the terminal and press the enter key. If the BWA software was installed and
added to the path correctly, you will see the help screen.
BWA has three alignment algorithms: BWA-MEM “bwa mem”, BWA-SW “bwa bwasw”,
and BWA-backtrack “bwa aln/samse/sampe”. Both “bwa mem” and “bwa bwasw” algo-
rithms are used for mapping short and long sequences produced by any of the sequenc-
ing technologies. The “bwa aln/samse/sampe” also called BWA-backtrack is designed for
Illumina short-sequence reads up to 100 bp. Among the three algorithms, “bwa mem” is
the most accurate and the fastest.
Indeed, aligning read sequences to a reference genome with BWA requires indexing the
reference genome using “bwa index” command. We can use this command to index the
human reference genome which was downloaded and indexed with “samtools faidx” above
as follows:
bwa index GRCh38.p13_ref.fna
The indexing will take some time depending on the size of the reference genome and the
memory of your computer. When the “bwa index” command finishes indexing, it will dis-
play the information, including the number of iterations, the elapsed time in second, the
indexed FASTA file name, and the real time and CPU time taken for the indexing process.
The indexing of the human genome may take up to six hours on a desktop computer of
32G RAM.
The BWA indexing process creates five bwa index files with extensions “.amb”, “.ann”,
“.bwt”, “.pac”, and “.sa”. The total storage space for the current human reference genome
and its index files is around 9.4G.
The “.amb” file indexes the locations of the ambiguous (unknown) bases in the FASTA
reference file that are flagged as N or another character but not as A,C,G, or T. The “.ann”
file contains annotation information such as sequence IDs and chromosome numbers.
The “.bwt” is a binary file for the Burrows–Wheeler transformed sequence. The “.pac” is a
binary file for the packed reference sequence. The “.sa” is also a binary file containing the
suffix array index. For mapping read sequences to the reference genome, all these five files
must be together in the same directory.